OLID-BR: offensive language identification dataset for Brazilian Portuguese
نویسندگان
چکیده
Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it also accompanied by significant drawbacks, particularly terms of proliferation fake news and vast dissemination hate speech. Identifying offensive comments a critical task for ensuring safety users, why industry academia have been working on developing solutions to problem. Prior research speech detection predominantly focused English language, with few studies devoted other languages such as Portuguese. This paper introduces Offensive Language Identification Dataset Brazilian Portuguese (OLID-BR), high-quality NLP dataset language detection, we make publicly available. The contains 6,354 (extendable 13,538) labeled using fine-grained three-layer annotation schema compatible datasets languages, allows training multilingual/cross-lingual models. five tasks available OLID-BR allow comments, classification types offenses racism, LGBTQphobia, sexism, xenophobia, so on, identification type target extraction toxic spans comments. All those can enhance capabilities content moderation systems providing deep contextual analysis or highlighting that text toxic. We further experiment evaluate state-of-the-art BERT-based NER models, demonstrates usefulness development toxicity texts.
منابع مشابه
A brazilian portuguese language corpus development
This article presents the techniques that are being used for the creation of a database related to the Brazilian Portuguese language. This database is composed of a collection of recorded voices, from different speakers and different regions of Brazil. The collected voices contain varied phonetic and phonologic information. The applications of this database are diverse, including synthesis and ...
متن کاملDermatology and the Brazilian Portuguese language orthographic reform.
The Brazilian Portuguese language orthographic reform has promoted changes in writing in less than 2% of its lexis. However, these changes have affected medical practice. The authors present in this article the main changes in the orthographic rules and gather a group of words that have had their spelling altered by this new language reform emphasizing the dermatological terms.
متن کاملVerb Clustering for Brazilian Portuguese
Levin-style classes which capture the shared syntax and semantics of verbs have proven useful for many Natural Language Processing (NLP) tasks and applications. However, lexical resources which provide information about such classes are only available for a handful of worlds languages. Because manual development of such resources is extremely time consuming and cannot reliably capture domain va...
متن کاملBrazilian Portuguese Words for Design
The University Repository is a digital collection of the research output of the University, available on Open Access. Copyright and Moral Rights for the items on this site are retained by the individual author and/or other copyright owners. Users may access full items free of charge; copies of full text items generally can be reproduced, displayed or performed and given to third parties in any ...
متن کاملBrazilian Portuguese version of the Patient Competency Rating Scale (PCRS-R-BR): semantic adaptation and validity.
This study describes the adaptation of a revised Brazilian version of the Patient Competency Rating Scale (PCRS-R-BR), which focuses on executive, mnemonic, and attention functions. Evidence of content-based and external validity is also reported. The cross-cultural adaptation was conducted in five phases: 1) translations and back translations; 2) item analysis by authors; 3) classification by ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Language Resources and Evaluation
سال: 2023
ISSN: ['1574-020X', '1574-0218']
DOI: https://doi.org/10.1007/s10579-023-09657-0